Comparing Set-Covering Strategies for Optimal Corpus Design

نویسندگان

Jonathan Chevelu

Nelly Barbot

Olivier Boëffard

Arnaud Delhay

چکیده

This article is interested in the problem of the linguistic content of a speech corpus. Depending on the target task, the phonological and linguistic content of the corpus is controlled by collecting a set of sentences which covers a preset description of phonological attributes under the constraint of an overall duration as small as possible. This goal is classically achieved by greedy algorithms which however do not guarantee the optimality of the desired cover. In recent works, a lagrangian-based algorithm, called LamSCP, has been used to extract coverings of diphonemes from a large corpus in French, giving better results than a greedy algorithm. We propose to keep comparing both algorithms in terms of the shortest duration, stability and robustness by achieving multi-represented diphoneme or triphoneme covering. These coverings correspond to very large scale optimization problems, from a corpus in English. For each experiment, LamSCP improves the greedy results from 3.9 to 9.7 percent.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora

Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a thi...

متن کامل

Lagrangian relaxation for optimal corpus design

This article is interestedin the problemof the linguisticcontent of a speech corpus. Depending on the target task (speech recognition, speech synthesis, etc) we try to control the phonological and linguistic content of the corpus by collectingan optimal set of sentences which make it possible to cover a preset description of phonological attributes (prosodic tags, allophones, syllables, etc) un...

متن کامل

Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem

Text-to-speech synthesis can be carried out by concatenation of acoustic units obtained from a continuous speech database. This paper presents the optimization of such as database according to phonetic criteria. A large corpus of texts is assembled (311 572 sentences), phonetized automatically and condensed (12 217 sentences) to retain only 10 tokens of the most frequent triphonemes. This is a ...

متن کامل

Developing a Corpus-Based Word List in Pharmacy Research ‎Articles: A Focus on Academic Culture

The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...

متن کامل

Translation Strategies in English to Persian Translation of Children's Literature based on Klingberg's Model

This research sought to identify the translation strategies adopted by the translator in Persian translation of 'whatever after, Fairest of all' written by 'Sarah Mlynowski' based on Klingberg's model (1986). To achieve the objectives of the study, a qualitative content analysis design was selected for it. The corpus of the study consisted of 60 pages of the novel 'whatever after, Fairest of al...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Comparing Set-Covering Strategies for Optimal Corpus Design

نویسندگان

چکیده

منابع مشابه

Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora

Lagrangian relaxation for optimal corpus design

Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem

Developing a Corpus-Based Word List in Pharmacy Research ‎Articles: A Focus on Academic Culture

Translation Strategies in English to Persian Translation of Children's Literature based on Klingberg's Model

عنوان ژورنال:

اشتراک گذاری